Biology Methods and Protocols
◐ Oxford University Press (OUP)
All preprints, ranked by how well they match Biology Methods and Protocols's content profile, based on 53 papers previously published here. The average preprint has a 0.08% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
WASIELEWSKI, E.; Karim, B.; SULPICE, L.; PECOT, T.
Show abstract
PurposePatient body composition is a major factor in patient management. Indeed, assessment of SMI as well as VFA and, to a lesser extent, SFA is a major factor in patient survival, particularly in surgery. However, to date, there is no simple, rapid, open-access assessment method. The aim of this work is to provide a simple, rapid and accurate tool for assessing patients body composition. Material and methodsA total of 343 patients underwent liver transplantation at the University Hospital of Rennes between January 1st, 2012 and December 31s, 2018. Image analysis was performed using the open source software ImageJ. Tissue distinction was based on Hounsfield density. The training dataset used 332 images (320 for training and 12 for validation). The model was evaluated on 11 patients. The complete software and video package is available at https://github.com/tpecot/MuViSS. ResultsIn total, the model was trained with 332 images and evaluated on 11 images. Model accuracy is 0.974 (SD 0.003), Jaccards index is 0.98 for visceral fat, 0.895 for muscle and 0.94 for subcutaneous fat. The Dice index is 0.958 (SD 0.003) for visceral fat, 0.944 (SD: 0.012) for muscle and 0.970 (SD: 0.013) for subcutaneous fat. Finally, the Normalized root mean square error is 0.007 for visceral fat, 0.0518 for muscle and 0.0124 for subcutaneous fat. ConclusionTo our knowledge, this is the first freely available model for assessing body composition. The model is fast, simple and accurate, based on Deep Learning. Statements and declarationsAll authors declare no conflict of interest
Dhaimade, P. A.; Henderson, R.
Show abstract
BackgroundLarge language models (LLMs) have demonstrated rapid advancements in natural language understanding and generation, prompting their integration into biomedical research, clinical practice, and professional education. However, systematic evaluation of LLMs in specialty-specific domains such as dentistry and periodontology remain limited, particularly regarding multidimensional performance metrics. ObjectiveTo conduct a comprehensive, multidimensional assessment of commercially available LLMs: GPT-4.0, GPT-5.0, and Claude SONNET 4.0 on the American Academy of Periodontology in-service examination, focusing on response accuracy, self-assessed confidence calibration, citation validity, and hallucination prevalence. MethodsModels were evaluated on the 2024 AAP In-Service Examination (331 questions) using two formats: Full Test (all questions at once) and Individual Question (one at a time). Prompts were standardized; models selected answers, and for GPT-5.0 and Claude SONNET 4.0, also provided confidence ratings and citations. Citation validity was assessed using a human-in-the-loop protocol with expert review. Statistical analyses included chi-square, McNemars, and logistic regression to assess accuracy, question fatigue, confidence calibration, and citation reliability. ResultsLLMs achieved high overall accuracy (78-87%), with the Individual Question format consistently yielding higher scores than Full Test, though differences were not statistically significant. Accuracy was highest in fact-dense domains (biochemistry, physiology, microbiology) and lowest in integrative domains (diagnosis, therapy). Significant question fatigue was observed in GPT-5.0 Full Test mode (OR = 0.997, p = 0.035), but not in Individual Question mode. Confidence scores predicted accuracy, with the strongest calibration in Individual Question mode. Citation analysis revealed frequent hallucinations, mostly critically erroneous, and citation validity was independent of answer accuracy. ConclusionsLLMs can answer a broad spectrum of periodontal specialty questions, but their reliability varies with context and information presentation. While promising as adjunctive tools, their outputs-- especially for complex reasoning and citations--require rigorous human review in educational and research settings to ensure accuracy and safety. Author SummaryArtificial intelligence chatbots are rapidly entering medical education, yet we lack comprehensive understanding of their reliability when students depend on them for learning. We developed a multidimensional evaluation framework to systematically assess AI performance beyond simple accuracy, examining how these systems behave across different medical topics, question types, and presentation formats. Using 331 real dental examination questions, we tested three major AI systems, analyzing not only correctness but also confidence calibration - whether AI confidence levels match actual accuracy - and implementing human-in-the-loop verification to check if cited sources actually exist. Our findings highlight critical vulnerabilities in current AI systems. Most alarmingly, these chatbots fabricated nearly half of their citations while maintaining unwavering confidence in both correct and incorrect responses. This combination of overconfidence and misinformation means students cannot distinguish reliable from unreliable AI responses. Additionally, we documented progressive performance decline during sequential questioning, similar to human cognitive fatigue. While we know AI systems generate rather than retrieve information, our research demonstrates the real-world consequences of this limitation. As artificial intelligence integrates into education, healthcare diagnostics, and insurance decisions, these findings underscore the urgent need for better evaluation frameworks and user education about AI limitations.
Olei, S.; Sarwin, G.; Staartjes, V. E.; Zanuttini, L.; Ryu, S.-J.; Regli, L.; Konukoglu, E.; Serra, C.
Show abstract
IntroductionSurgical success hinges on two core factors: technical execution and cognitive planning. While the former can be trained and potentially augmented through robotics, the latter -- developing an accurate "mental roadmap" of a certain operation -- remains complex, deeply individualized and resistant to standardization. In neurosurgery, where minute anatomical distinctions can dictate outcomes, enhancing intraoperative guidance could reduce variability among surgeons and improve global standards. Recent developments in machine vision offer a promising avenue. Previous studies demonstrated that deep learning models could successfully identify anatomical landmarks in highly standardized procedures such as trans-sphenoidal surgery (TSS). However, the applicability of such techniques in more variable and multidimensional intracranial procedures remains unproven. This study investigates whether a deep learning model can recognize key anatomical structures during the more complex pterional trans-sylvian (PTS) approach. Materials and MethodsWe developed a deep learning object detection model (YOLOv7x) trained on 5.307 labeled frames from 78 surgical videos of 76 patients undergoing PTS. Surgical steps were standardized, and key anatomical targets--frontal/temporal dura, inferior frontal/superior temporal gyri, optic and olfactory nerves and internal carotid artery (ICA) -- were annotated by specifically trained neurosurgical residents and verified by the operating surgeon. Bounding boxes derived from segmentation masks served as training inputs. Performance was evaluated using five-fold cross-validation. ResultsThe model achieved promising detection performance for deep structures, particularly the optic nerve (AP50: 0.73) and ICA (AP50: 0.67). Superficial structures, like the dura and the cortical gyri, had lower precision (AP50 range: 0.25-0.45), likely due to morphological similarity and optical variability. Performance variability across classes reflects the complexity of the anatomical setting along with data limitations. ConclusionThis study shows the feasibility of applying machine vision techniques for anatomical detection in a complex and variable neurosurgical setting. While challenges remain in detecting less distinctive structures, the high accuracy achieved for deep anatomical landmarks validates this approach. These findings mark an essential step towards a machine vision surgical guidance system. Future applications could include real-time anatomical recognition, integration with neuronavigation and the development of AI-supported "surgical roadmaps" to improve intraoperative orientation and global neurosurgical practice.
Boelders, S. M.; Nicenboim, B.; Postma, E.; Rutten, G.-J.; Gehring, K.; Ong, S.
Show abstract
IntroductionCognitive impairments of patients with a glioma are increasingly considered when making treatment decisions considering a personalized onco-functional balance. Predicting cognitive functioning before surgery can serve as a steppingstone for the clinical goal of predicting cognitive functioning after surgery. However, in a previous study, machine-learning models could not reliably predict cognitive functioning before surgery using a comprehensive set of clinical variables. The current study aims to improve predictions while making the uncertainty in individual predictions explicit. MethodPre-operative cognitive functioning was predicted for 340 patients with a glioma across eight cognitive tests. This was done using six multivariate Bayesian regression models following a machine-learning approach while using a comprehensive set of clinical variables. Four models included interactions with- or a multilevel structure over histopathological diagnosis. Point-wise predictions were compared using the coefficient of determination (R2) and the best-performing model was interpreted. ResultsBayesian models outperformed machine-learning models and benefitted from using shrinkage priors. The R2 ranged between 0.3% and 21.5% with a median across tests of 7.2%. Estimated errors of individual prediction were high. The best-performing model allowed parameters to differ across histopathological diagnoses while pulling them toward the population mean. ConclusionBayesian models can improve predictions while providing uncertainty estimates for individual predictions. Despite this, the uncertainty in predictions of pre-operative cognitive functioning using the included clinical variables remained high. Consequently, clinicians should not infer cognitive functioning from these variables. Different histopathological diagnoses are best treated as distinct yet related. HighlightsO_LIBayesian regression outperformed machine-learning models. C_LIO_LIPredictions were uncertain despite improvements. C_LIO_LIDifferent histopathological diagnoses are best treated as distinct yet related. C_LI Importance of the studyCognitive impairments of patients with a glioma are increasingly considered when making treatment decisions considering a personalized onco-functional balance. Predicting cognitive functioning before surgery serves as a steppingstone for the clinical goal of predicting cognitive functioning after surgery. The current study is important for two reasons. First, it demonstrates that Bayesian models can improve predictions of pre-operative cognitive functioning over popular machine-learning models. Second, it explicitly shows that individual predictions of pre-operative cognitive functioning based on a comprehensive set of readily available clinical variables included in the current study are uncertain. Consequently, clinicians should not infer cognitive functioning from these variables. Last, it shows that prediction models may benefit a multifaceted view of patients and from treating patients with different histopathological diagnoses as distinct yet related.
Hussein, K. I.; Chan, L.; Van Vleck, T.; Beers, K.; Mindt, M. R.; Wolf, M.; Curtis, L. M.; Agarwal, P.; Wisnivesky, J.; Nadkarni, G. N.; Federman, A.
Show abstract
INTRODUCTIONEarly detection of patients with cognitive impairment may facilitate care for individuals in this population. Natural language processing (NLP) is a potential approach to identifying patients with cognitive impairment from electronic health records (EHR). METHODSWe used three machine learning algorithms (logistic regression, multilayer perceptron, and random forest) using clinical terms extracted by NLP to predict cognitive impairment in a cohort of 199 patients. Cognitive impairment was defined as a mini-mental status exams (MMSE) score <24. RESULTSNLP identified 69 (35%) patients with cognitive impairment and ICD codes identified 44 (22%). Using MMSE as a reference standard, NLP sensitivity was 35%, specificity 66%, precision 41%, and NPV 61%. The random forest method had the best test parameters; sensitivity 95%, specificity 100%, precision 100%, and NPV 97% DISCUSSIONNLP can identify adults with cognitive impairment with moderate test performance that is enhanced with machine learning.
Luo, J.; Ma, J.; Wang, T.; Qiu, Y.; Yang, Y.; Qiu, H.; Chen, G.; Wang, W.
Show abstract
Background & AimsHepatocellular carcinoma is the most common type of primary liver cancer and remains a major global health challenge. In resource-limited settings, patients often face barriers such as low screening rates, poor adherence, and limited access to medical information. Despite comprehensive clinical guidelines, issues like inadequate patient education and ineffective communication persist. While large language models show promise in clinical communication and decision support, their performance in hepatocellular carcinoma management has not been systematically evaluated across multiple dimensions. MethodsTen emerging language models, including general-purpose and medical-domain models, were assessed under prompted and unprompted conditions using a standardized question set covering five key stages: general knowledge, screening, diagnosis, treatment, and follow-up. Accuracy was rated by experts, while semantic consistency, local interpretability, information entropy, and readability were measured computationally. ResultsChatGPT-4o and Grok-3 achieved the highest accuracy (2.62 {+/-} 0.06, 93%; 2.60 {+/-} 0.06, 95%) and interpretability (0.43;0.43). Prompting significantly improved accuracy (p < 0.001) and interpretability (p < 0.001) across all models. Semantic consistency declined slightly in most models; information entropy generally increased; readability changes varied. ConclusionsThis study presents the first multidimensional evaluation of large language models in hepatocellular carcinoma-related clinical tasks. General-purpose models outperformed some medical models, revealing limitations in domain-specific fine-tuning. Prompt design strongly influenced model performance. Further research should integrate diverse prompt strategies and clinical scenarios to improve the usability of language models in real-world oncology settings. Lay summaryThis study evaluated how well-advanced language-based artificial intelligence models can answer clinical questions related to hepatocellular carcinoma. The results showed that some models, especially when guided with structured instructions, provided accurate and understandable responses. These findings suggest that such tools may help improve communication and access to information for both doctors and patients managing liver cancer.
Lu, Y.; Srinivasan, G.; Preum, S.; Pettus, J.; Davis, M.; Greenburg, J.; Vaickus, L.; Levy, J.
Show abstract
Deep learning (DL) algorithms continue to develop at a rapid pace, providing researchers access to a set of tools capable of solving a wide array of biomedical challenges. While this progress is promising, it also leads to confusion regarding task-specific model choices, where deeper investigation is necessary to determine the optimal model configuration. Natural language processing (NLP) has the unique ability to accurately and efficiently capture a patients narrative, which can improve the operational efficiency of modern pathology laboratories through advanced computational solutions that can facilitate rapid access to and reporting of histological and molecular findings. In this study, we use pathology reports from a large academic medical system to assess the generalizability and potential real-world applicability of various deep learning-based NLP models on reports with highly specialized vocabulary and complex reporting structures. The performance of each NLP model examined was compared across four distinct tasks: 1) current procedural terminology (CPT) code classification, 2) pathologist classification, 3) report sign-out time regression, and 4) report text generation, under the hypothesis that models initialized on domain-relevant medical text would perform better than models not attuned to this prior knowledge. Our study highlights that the performance of deep learning-based NLP models can vary meaningfully across pathology-related tasks. Models pretrained on medical data outperform other models where medical domain knowledge is crucial, e.g., current procedural terminology (CPT) code classification. However, where interpretation is more subjective (i.e., teasing apart pathologist-specific lexicon and variable sign-out times), models with medical pretraining do not consistently outperform the other approaches. Instead, fine-tuning models pretrained on general or unrelated text sources achieved comparable or better results. Overall, our findings underscore the importance of considering the nature of the task at hand when selecting a pretraining strategy for NLP models in pathology. The optimal approach may vary depending on the specific requirements and nuances of the task, and related text sources can offer valuable insights and improve performance in certain cases, contradicting established notions about domain adaptation. This research contributes to our understanding of pretraining strategies for large language models and further informs the development and deployment of these models in pathology-related applications.
Nguyen, H. P.
Show abstract
In the field of pattern recognition, achieving high accuracy is essential. While training a model to recognize different complex images, it is vital to fine-tune the model to achieve the highest accuracy possible. One strategy for fine-tuning a model involves changing its activation function. Most pre-trained models use ReLU as their default activation function, but switching to a different activation function like Hard-Swish could be beneficial. This study evaluates the performance of models using ReLU, Swish and Hard-Swish activation functions across diverse image datasets. Our results show a 2.06% increase in accuracy for models on the CIFAR-10 dataset and a 0.30% increase in accuracy for models on the ATLAS dataset. Modifying the activation functions in architecture of pre-trained models lead to improved overall accuracy.
duman, a.; Sun, X.; Powell, J. R.; Spezi, E.
Show abstract
PurposeIn this study, we develop and validate an interpretable machine learning (ML) model that integrates a hybrid Swarm Intelligence (SI)-based feature selection method with Magnetic Resonance Imaging (MRI)-derived radiomic features (RFs) to estimate overall survival (OS) in Glioblastoma Multiforme (GBM) patients. This study seeks to enhance the generalizability of the developed prognostic model and its potential for clinical integration by emphasizing feature reproducibility and leveraging multi-institutional retrospective datasets. MethodsA cohort of 276 GBM patients with open-access pre-treatment MRI data (including T1, T1ce, T2, and FLAIR sequences) was used to perform comprehensive radiomic analysis. The extraction protocol yielded 1980 RFs per patient, extracted from three tumor regions (enhancing tumor: ET, tumor core: TC, and whole tumor: WT). The prognostic framework was built step-by-step, starting with a model of up to 10 RFs and then improving prediction by adding a single clinical feature (Age). In the training (discovery) dataset, we employed five-fold cross-validation combined with bootstrapping to ensure robust methodological validation. Model evaluation covered the C-index with 95% confidence intervals (CI) and survival stratification using Kaplan-Meier curves and the log-rank test to separate patients into low- and high-risk groups for OS. ResultsThe final survival model integrates patient age and ten independent RFs; the model itself was optimized using features derived from three tumor contours and two MRI sequences (T1, FLAIR). The models performance in the holdout test dataset was evaluated by a concordance index (C-index) of 0.71 (95% CI: 0.61-0.79), exhibiting statistically significant risk stratification (p = 2 x 10-). Upon external validation, the model achieved a C-index of 0.64, maintaining statistical significance (p = 1 x 10-{superscript 2}). The research combined the regularized Cox regression (Cox-LASSO), a traditional ML model, with a new SI-based LASSO-PSO method, yielding significant stratification. To our knowledge, the present study offers the first documented use of an interpretable ML model with an SI-based approach (LASSO-PSO) for successful risk stratification based on OS. ConclusionThis study provides the development and validation of a clinical-radiomic model capable of conducting time-to-event analysis in GBM patients. By leveraging multicenter retrospective datasets, the model enables effective risk stratification based on OS. A key direction for future work involves exploring the combination of deep learning(DL)-based features and engineered features extracted via standardized convolutional filters, with the objective of improving OS prediction.
Adegbosin, O. T.; Patel, H.
Show abstract
BackgroundMicrosatellite stability status determination is important for prognostication and therapeutic decision making in colorectal cancer management, but the conventional methods for this assessment are not readily available, especially in low- and middle-income countries. Deep learning (DL) models have been proposed for addressing this problem; however, potential computational cost due to model complexity and inadequate explainability may limit their adoption in low-resource settings. This study explored the potential of explainable lightweight models for detection of microsatellite instability in colorectal cancer. MethodsDL models were trained using a public dataset of colorectal cancer histology images and then used to classify a set of test images into one of two classes: microsatellite instability or microsatellite stability. The models were compared for efficiency. Gradient-weighted class activation mapping (Grad-CAM) was used to interpret the models decision making. ResultsThe simpler convolutional neural network (CNN) trained from scratch had modest performance (accuracy=0.757, area under receiver-operating characteristic curve [AUROC]=0.840). With an attention mechanism added, these values increased, but specificity and sensitivity reduced. Pretrained models performed better than the ones trained from scratch, and EfficientNet_B0 had the best balance of high performance and low computational requirements (accuracy=0.936, AUROC=0.990, negative predictive value=0.923, specificity=0.953, 4,010,000 trainable parameters, 0.38 gigaFLOPs). However, a simple CNN model with attention mechanism had the best interpretability based on Grad-CAM. ConclusionThis study demonstrated that DL models that are lightweight when compared to previously proposed ones can be useful for colorectal cancer microsatellite instability screening in resource-limited settings while balancing performance and computational efficiency.
Levy, J. J.; Jackson, C. R.; Sriharan, A.; Christensen, B.; Vaickus, L. J.
Show abstract
Evaluation of a tissue biopsy is often required for the diagnosis and prognostic staging of a disease. Recent efforts have sought to accurately quantitate the distribution of tissue features and morphology in digitized images of histological tissue sections, Whole Slide Images (WSI). Generative modeling techniques present a unique opportunity to produce training data that can both augment these models and translate histologic data across different intra-and-inter-institutional processing procedures, provide cost-effective ways to perform computational chemical stains (synthetic stains) on tissue, and facilitate the creation of diagnostic aid algorithms. A critical evaluation and understanding of these technologies is vital for their incorporation into a clinical workflow. We illustrate several potential use cases of these techniques for the calculation of nuclear to cytoplasm ratio, synthetic SOX10 immunohistochemistry (IHC, sIHC) staining to delineate cell lineage, and the conversion of hematoxylin and eosin (H&E) stain to trichome stain for the staging of liver fibrosis.
Mondal, A.; Karad, R. K.; Bhattacharjee, B.; Saha, B.
Show abstract
BackgroundArtificial Intelligence (AI) has potential to transform healthcare including the field of infectious diseases diagnostics. This study assesses the capability of three large language models (LLMs), GPT 4, Llama 3, and Gemini 1.5 to generate differential diagnoses, comparing their outputs against those of medical experts to evaluate AIs potential in augmenting clinical decision-making. MethodsThis study evaluates the differential diagnosis capabilities of three LLMs, GPT 4, Llama 3, and Gemini 1.5, using 50 simulated infectious disease cases. The cases were diverse, complex, and reflective of common clinical scenarios, including detailed histories, symptoms, lab results, and imaging findings. Each model received standardized case information and produced differential diagnoses, which were then compared to reference differential diagnosis lists created by medical experts. The analysis utilized the Jaccard index and Kendalls Tau to assess similarity and order accuracy, summarizing findings with mean, standard deviation, and combined p-values. ResultsThe mean numbers of differential diagnoses generated by GPT 4, Llama 3, and Gemini 1.5 were 6.22, 5.06, and 10.02 respectively which was significantly different (p<0.001) from the medical experts. The mean Jac-card index of GPT 4, Llama 3, and Gemini 1.5 were 0.3, 0.21, and 0.24 while the mean Kendalls Tau were 0.4, 0.7, and 0.33 respectively. The combined p-value of GPT 4, Llama 3, and Gemini 1.5 were 1, 1, 0.979 respectively indicating no significant association between the differential diagnosis generated by the LLMs and the medical experts. ConclusionAlthough LLMs like GPT 4, Llama 3, and Gemini 1.5 exhibit varying effectiveness, none align significantly with expert-level diagnostic accuracy, emphasizing the need for further development and refinement. The findings highlight the importance of rigorous validation, ethical considerations, and seamless integration into clinical workflows to ensure AI tools enhance healthcare delivery and patient outcomes effectively.
Wakiya, T.; Sanada, Y.; Okada, N.; Hirata, Y.; Horiuchi, T.; Omameuda, T.; Onishi, Y.; Sakuma, Y.; Yamaguchi, H.; Sasaki, Y.; Sata, N.
Show abstract
BackgroundMassive intraoperative bleeding (IBL) in liver transplantation (LT) poses serious risks and strains healthcare resources necessitating better predictive models for risk stratification. As traditional models often fail to capture the complex, non-linear patterns underlying bleeding risk, this study aimed to develop data-driven machine learning models for predicting massive IBL during LT using preoperative factors. MethodsTwo hundred ninety consecutive LT cases from a prospective database were analyzed. Logistic regression models were built using 73 preoperative demographic and laboratory variables to predict massive IBL ([≥] 80 mL/kg). The dataset was randomly split (70% training, 30% testing). The model was trained and validated through three-fold cross-validation, with backward stepwise feature selection iterated 100 times across unique random splits. The final model, based on a high stability index, was evaluated using the area under the curve (AUC). ResultsMassive IBL was observed in 141 patients (48.6%). In standard logistic regression, significant differences were found in 42 of 73 factors between groups stratified by massive IBL, however, substantial multicollinearity limited interpretability. In the feature selection across 100 iterations, the data-driven model achieved an average AUC of 0.840 in the validation and 0.738 in the test datasets. The final model, based on 11 selected features with a high stability index, achieved an AUC of 0.844. An easy-to-use online risk calculator for massive IBL was developed and is available at: https://tai1wakiya.shinyapps.io/ldlt_bleeding_ml/. ConclusionsOur findings highlight the potential of machine learning in capturing complex risk factor interactions for predicting massive IBL in LT.
Bou Malham, V.; Leandre, F.; Hamimi, A.; Lagoutte, I.; Bouchet, S.; Gougelet, A.; Colnot, S.; Desbois-Mouthon, C.
Show abstract
Background & aimsConstitutive activation of the {beta}-catenin pathway is a determining feature in the pathogenesis of two primary liver cancers, namely HCC and hepatoblastoma (HB). Activating alterations in CTNNB1 gene and, to a lesser extent, inhibiting alterations in APC gene are observed in 30 to 40% of HCC cases and 80 to 90% of HB cases. For both tumours, therapeutic management is far from optimal. Therefore, relevant experimental models are needed to increase our knowledge and test new therapeutic approaches. MethodsOrganoids and tumouroids were established from APC{Delta}hep and {beta}cat{Delta}ex3 mouse models, which are clinically relevant models for {beta}-catenin-activated HCC and mesenchymal HB. We developed a new methodological approach based on a dynamic suspension culture in a rotating bioreactor. Morphological and molecular characteristics and sensitivity to WNTinib, a treatment already successfully tested on human HCC and HB tumouroids, were evaluated by histology, immunohistochemistry, immunofluorescence, and RT-qPCR. ResultsThis easy-to-implement methodology allows for the rapid generation of a large number of organoids and tumouroids that are uniform in size and show no signs of cell death in their core. The robustness of the methodology is illustrated by the maintenance of the histological architecture, cell diversity and gene expression in organoids and tumouroids in comparison with the native liver tissues. In addition, the value of the HCC-derived tumouroids for evaluating cancer treatment was assessed based on their responsiveness to the {beta}-catenin antagonist WNTinib. ConclusionsThe organoids and tumouroids that we present here are new reliable in vitro cancer models, recapitulating the main features of {beta}-catenin-driven HCC and mesenchymal HB. They can be integrated into an appropriate platform for drug screening and could enable the development of "a la carte" therapies that are urgently needed for these indications. Impact and implicationsThis study addresses the critical need for representative in vitro models to investigate {beta}-catenin-driven liver cancers. The organoids and tumouroids developed here are particularly valuable for researchers seeking robust, reproducible models that accurately reflect the cellular diversity and gene expression profiles of native liver tumours. These findings have practical applications in exploring cancer mechanisms, screening new drugs, optimizing personalized treatment strategies, and reducing reliance on animal models, which ultimately benefits patients. HighlightsO_LIEasy and rapid generation of mouse liver organoids and tumouroids from {beta}-catenin activated tumours using culture in a bioreactor C_LIO_LITumouroids preserve histology, cell diversity, and gene expression of native tissue C_LIO_LIHCC-derived tumouroids respond to {beta}-catenin inhibitor WNTinib C_LIO_LIThese reliable 3D models reduce reliance on animal experiments for drug testing C_LI
Bolut, C.; Pacary, A.; Pieruccioni, L.; Ousset, M.; Paupert, J.; Casteilla, L.; Simoncini, D.
Show abstract
Machine learning (ML) models are effective at classifying images across various fields, including biology. However, their performance on biomedical images is often limited by the small size of available datasets that are constrained by the time-consuming and costly nature of experimental data collection. A review of the literature shows that many studies using biomedical images fail to follow ML best practices. This study focuses on regenerative medicine, which aims to promote tissue regeneration rather than scarring. To explore this process, we applied ML to a limited dataset of images of mice tissues, aiming to distinguish between regenerating and scarring samples. As expected binary classification failed to generalize to independent data. A novel SHAP-based analysis revealed that the overfitting models were based on spurious correlations including individual mice characteristics that aligned with the regeneration/scarring labels. The models appeared to be solving the binary classification task, but were in fact recognizing individuals. To investigate this behavior further, we examined the test set confusion matrix of a model trained to identify individual mice. We observed that, beyond individual recognition, individuals were grouped according to the time elapsed after injury (day 3 or 10) and the healing outcome (regeneration or scarring). We hypothesized that these groupings were based on relevant biological information captured by the model. To test this hypothesis, we successfully trained a model to classify images according to the time elapsed after injury (3 or 10 days), demonstrating that ML can extract relevant biological information when the task is aligned with what the data can actually support. Altogether, this study demonstrates that carefully examining explanations of a model is not only an effective way to unveil putative biases but also to extract relevant information from a limited dataset. Author summaryMachine learning is increasingly used to analyze biomedical images, but in many experimental settings only small datasets are available, which can easily mislead powerful models. In this study, we looked at images from mice tissues, with the goal to distinguish healing by regeneration from healing by scarring. Although standard machine learning models appeared to perform well during training, they failed to generalize to new animals. By carefully analyzing model explanations, we found that the models were not learning biologically meaningful patterns of tissue repair, but instead were recognizing individual mice based on subtle image-specific signatures. Importantly, this same analysis revealed that the models did capture relevant biological information when the task was better aligned with the data, such as distinguishing early versus late stages of healing. Our results highlight how explanation methods can uncover hidden biases, prevent false conclusions, and help researchers extract meaningful biological insights even from limited and imperfect datasets.
Lohani, A.; Mishra, B. K.; Wertheim, K. Y.; Fagbola, T. M.
Show abstract
In recent years, different Convolutional Neural Networks (CNNs) approaches have been applied for image classification in general and specific problems such as breast cancer diagnosis, but there is no standardising approach to facilitate comparison and synergy. This paper attempts a step-by-step approach to standardise a common application of image classification with the specific problem of classifying breast ultrasound images for breast cancer diagnosis as an illustrative example. In this study, three distinct datasets: Breast Ultrasound Image (BUSI), Breast Ultrasound Image (BUI), and Ultrasound Breast Images for Breast Cancer (UBIBC) datasets have been used to build and fine-tune custom and pre-trained CNN models systematically. Custom CNN models have been built, and hence, transfer learning (TL) has been applied to deploy a broad range of pre-trained models, optimised by applying data augmentation techniques and hyperparameter tuning. Models were trained and tested in scenarios involving limited and large datasets to gain insights into their robustness and generality. The obtained results indicated that the custom CNN and VGG19 are the two most suitable architectures for this problem. The experimental results highlight the significance of employing an effective step-by-step approach in image classification tasks to enhance the robustness and generalisation capabilities of CNN-based classifiers.
Tan, Z. Q.; Roscoe, M. G.; Addison, O.; Li, Y.
Show abstract
BackgroundDeep learning has achieved rapid development in recent years and has been applied to various fields in dentistry. While cross-disciplinary research between artificial intelligence and dentistry is growing exponentially, most studies rely on off-the-shelf machine learning models, with only a small portion introducing technological novelty. Furthermore, tasks such as dental disease diagnosis are inherently complex, with high intra- and interobserver variability where dentists often interpret radiographs differently and offer varying subsequent treatments. However, many studies overlooked this variability, assuming no data and model uncertainty in dental tasks. Additionally, many evaluated their methods using private and small-scale datasets, making fair comparisons of their outcome metrics challenging and introducing significant predictive bias in artificial intelligence models. The goal of the current study was to examine and critically assess recent novel advances in artificial intelligence in dentistry across a wide range of dental applications. MethodsWe begin by presenting foundational concepts in artificial intelligence and adopt a unique approach by focusing on the novelty of deep learning methods. Following that, we conducted a systematic review by searching online databases (PubMed, IEEE Xplore, arXiv, and Google Scholar) for publications related to artificial intelligence, machine learning, and deep learning applications in dentistry. ResultsA total of 91 articles met the inclusion criteria, and we presented a comprehensive analysis of the studies. Moreover, we discuss the limitations of recent studies on artificial intelligence in dentistry and identify key research opportunities for progress and innovation. These include integrating dental domain knowledge, quantifying uncertainty, leveraging large models and multiple sources of datasets, developing efficient deep learning pipelines, and conducting thorough evaluations in both simulated and real-world experimental settings. ConclusionRecent advancements in deep learning demonstrate great potential in dentistry applications. However, future research to address the limitations in recent studies is needed to fully realize its potential for enhancing dental professionals to utilize AI effectively and improve clinical and patient outcome in dentistry.
de Sousa, E. M. V.; Kumar, A.; Coupland, C.; Vaz, T. F.; Jones, W.; Valcarce-Dineiro, R.; Calaminus, S. D. J.
Show abstract
Manual counting of platelets, in microscopy images, is greatly time-consuming. Our goal was to automatically segment and count platelets images using a deep learning approach, applying U-Net and Fully Convolutional Network (FCN) modelling. Data preprocessing was done by creating binary masks and utilizing supervised learning with ground-truth labels. Data augmentation was implemented, for improved model robustness and detection. The number of detected regions was then retrieved as a count. The study investigated the U-Net models performance with different datasets, indicating notable improvements in segmentation metrics as the dataset size increased, while FCN performance was only evaluated with the smaller dataset and abandoned due to poor results. U-Net surpassed FCN in both detection and counting measures in the smaller dataset Dice 0.90, accuracy of 0.96 (U-Net) vs Dice 0.60 and 0.81 (FCN). When tested in a bigger dataset U-Net produced even better values (Dice 0.99, accuracy of 0.98). The U-Net model proves to be particularly effective as the dataset size increases, showcasing its versatility and accuracy in handling varying cell sizes and appearances. These data show potential areas for further improvement and the promising application of deep learning in automating cell segmentation for diverse life science research applications. Author SummaryDeep Learning can be used with good results for automatic cells images segmentations, reducing the time applied by scientists to this task. In our research platelets images were automatically segmented and counted using by applying U-Net and Fully Convolutional Network (FCN) modelling. Data preprocessing was done by creating binary masks and utilizing supervised learning with ground-truth labels, after data augmentation. U-Net surpassed FCN in both detection and counting measures in a smaller dataset. The U-Net model proves to be particularly effective as the dataset size increases, showcasing its versatility and accuracy in handling varying cell sizes and appearances. Our study shows potential areas for further improvement and the promising application of deep learning in automating cell segmentation for diverse life science research applications.
Thomas, M. G.; Mastorides, S. M.; Borkowski, S. A.; Reed, J. L.; Deland, L. A.; Thomas, L. B.; Borkowski, A. A.
Show abstract
The role of artificial intelligence (AI) in health care delivery is growing rapidly. Due to its visual nature, the specialty of anatomic pathology has great promise for applications in AI. We examine the potential of six different AI models for differentiating and diagnosing the three most common primary liver tumors: hepatocellular carcinoma (HCC), cholangiocarcinoma (CCA), and combined HCC and CCA (cHCC/CCA). Our results demonstrated that for all three diagnoses, the sensitivity, specificity, positive predictive value, and negative predictive value was [≥] 94% in the best model tested, with results [≥] 92% in all categories in three of the models. These values are comparable to interpretation by general pathologists alone and demonstrate AIs potential in interpreting patient specimens for primary liver carcinoma. Applications such as these have multiple implications for delivering quality patient care, including assisting with intraoperative consultations and providing a rapid "second opinion" for confirmation and increased accuracy of final diagnoses. These applications may be particularly useful in underserved areas with shortages of subspecialized pathologists or after hours in larger medical centers. In addition, AI models such as these can decrease turnaround times and the inter- and intra-observer variability well documented in pathologic diagnoses. AI offers great potential in assisting pathologists in their day-to-day practice.
Bonde, M.; Bonde, A.; Kaafarani, H.; Sillesen, M.; Millarch, A.
Show abstract
IntroductionPancreaticoduodenectomy (PD) for patients with pancreatic ductal adenocarcinoma (PDAC) is associated with a high risk of postoperative complications (PoCs) and risk prediction of these is therefore critical for optimal treatment planning. We hypothesize that novel deep learning network approaches through transfer learning may be superior to legacy approaches for PoC risk prediction in the PDAC surgical setting. MethodsData from the US National Surgical Quality Improvement Program (NSQIP) 2002-2018 was used, with a total of 5,881,881 million patients, including 31,728 PD patients. Modelling approaches comprised of a model trained on a general surgery patient cohort and then tested on a PD specific cohort (general model), a transfer learning model trained on the general surgery patients with subsequent transfer and retraining on a PD-specific patient cohort (transfer learning model), a model trained and tested exclusively on the PD-specific patient cohort (direct model), and a benchmark random forest model trained on the PD patient cohort (RF model). The models were subsequently compared against the American College of Surgeons (ACS) surgical risk calculator (SRC) in terms of predicting mortality and morbidity risk. ResultsBoth the general model and transfer learning model outperformed the RF model in 14 and 16 out of 19 prediction tasks, respectively. Additionally, both models outperformed the direct model on 17 out of the 19 tasks. The transfer learning model also outperformed the general model on 11 out of the 19 prediction tasks. The transfer learning model outperformed the ACS-SRC regarding mortality and all the models outperformed the ACS-SRC regarding the morbidity prediction with the general model achieving the highest Receiver Operator Area Under the Curve (ROC AUC) of 0.668 compared to the 0.524 of the ACS SRC. ConclusionDNNs deployed using a transfer learning approach may be of value for PoC risk prediction in the PD setting.